53 research outputs found
Single Cell Training on Architecture Search for Image Denoising
Neural Architecture Search (NAS) for automatically finding the optimal
network architecture has shown some success with competitive performances in
various computer vision tasks. However, NAS in general requires a tremendous
amount of computations. Thus reducing computational cost has emerged as an
important issue. Most of the attempts so far has been based on manual
approaches, and often the architectures developed from such efforts dwell in
the balance of the network optimality and the search cost. Additionally, recent
NAS methods for image restoration generally do not consider dynamic operations
that may transform dimensions of feature maps because of the dimensionality
mismatch in tensor calculations. This can greatly limit NAS in its search for
optimal network structure. To address these issues, we re-frame the optimal
search problem by focusing at component block level. From previous work, it's
been shown that an effective denoising block can be connected in series to
further improve the network performance. By focusing at block level, the search
space of reinforcement learning becomes significantly smaller and evaluation
process can be conducted more rapidly. In addition, we integrate an innovative
dimension matching modules for dealing with spatial and channel-wise mismatch
that may occur in the optimal design search. This allows much flexibility in
optimal network search within the cell block. With these modules, then we
employ reinforcement learning in search of an optimal image denoising network
at a module level. Computational efficiency of our proposed Denoising Prior
Neural Architecture Search (DPNAS) was demonstrated by having it complete an
optimal architecture search for an image restoration task by just one day with
a single GPU
Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition
Recently, skeleton-based human action has become a hot research topic because
the compact representation of human skeletons brings new blood to this research
domain. As a result, researchers began to notice the importance of using RGB or
other sensors to analyze human action by extracting skeleton information.
Leveraging the rapid development of deep learning (DL), a significant number of
skeleton-based human action approaches have been presented with fine-designed
DL structures recently. However, a well-trained DL model always demands
high-quality and sufficient data, which is hard to obtain without costing high
expenses and human labor. In this paper, we introduce a novel data augmentation
method for skeleton-based action recognition tasks, which can effectively
generate high-quality and diverse sequential actions. In order to obtain
natural and realistic action sequences, we propose denoising diffusion
probabilistic models (DDPMs) that can generate a series of synthetic action
sequences, and their generation process is precisely guided by a
spatial-temporal transformer (ST-Trans). Experimental results show that our
method outperforms the state-of-the-art (SOTA) motion generation approaches on
different naturality and diversity metrics. It proves that its high-quality
synthetic data can also be effectively deployed to existing action recognition
models with significant performance improvement
A New Feature Normalization Scheme Based on Eigenspace for Noisy Speech Recognition
Abstract. We propose a new feature normalization scheme based on eigenspace, for achieving robust speech recognition. In particular, we employ the Mean and Variance Normalization (MVN) in eigenspace using unique and independent eigenspaces to cepstra, delta and delta-delta cepstra respectively. We also normalize training data in eigenspace and get the model from the normalized training data. In addition, a feature space rotation procedure is introduced to reduce the mismatch of training and test data distribution in noisy condition. As a result, we obtain a substantial recognition improvement over the basic eigenspace normalization
Topological mappings of video and audio data
We review a new form of self-organizing map which is based on a nonlinear projection of latent points into data space, identical to that performed in the Generative Topographic Mapping (GTM).1 But whereas the GTM is an extension of a mixture of experts, this model is an extension of a product of experts.2 We show visualisation and clustering results on a data set composed of video data of lips uttering 5 Korean vowels. Finally we note that we may dispense with the probabilistic underpinnings of the product of experts and derive the same algorithm as a minimisation of mean squared error between the prototypes and the data. This leads us to suggest a new algorithm which incorporates local and global information in the clustering. Both ot the new algorithms achieve better results than the standard Self-Organizing Map
- …